From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation

نویسندگان

  • Iris Hendrickx
  • Rita Marquilhas
چکیده

We aim to tackle the problem of spelling variations in a corpus of personal Portugese letters from the 16th to the 20th century. We investigated the extent to which the task of normalising Portuguese spelling can be accomplished automatically. We adapted VARD2 (Baron and Rayson, 2008), a statistical tool for normalising spelling, for use with the Portuguese language and studied its performance over four different time periods. Our results showed that VARD2 performed best on the older letters and worst on the most modern ones. In an extrinsic evaluation, we measured the usefulness of automatic normalisation for the linguistic task of automatic POS-tagging and showed that automatic normalisation of spelling helps improve the performance of the POS-tagger.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Rule-based search in historical text databases - Visualization techniques

The project Rule-Based Search in Historical Databases with Non-Standard Spellings (RSNSR, Pilz et al. 2005) will provide an online-available search-engine that can be used by interested amateurs as well as professional linguists. Parallel to the implementation of a customizable software architecture to support an efficient search functionally recalling all relevant historical spellings of a mod...

متن کامل

Developing an automated semantic analysis system for Early Modern English

As reported by Wilson and Rayson (1993) and Rayson and Wilson (1996), the UCREL semantic analysis system (USAS) has been designed to undertake the automatic semantic analysis of present-day English (henceforth PresDE) texts. In this paper, we report on the feasibility of (re)training the USAS system to cope with English from earlier periods, specifically the Early Modern English (henceforth Emo...

متن کامل

The influence of graphotactic knowledge on adults' learning of spelling.

Three experiments investigated whether and how the learning of spelling by French university students is influenced by the graphotactic legitimacy of the spellings. Participants were exposed to three types of novel spellings: AB, which do not contain doublets (e.g., guprane); AAB, with a doublet before a single consonant, which is legitimate in French (e.g., gupprane); and ABB, with a doublet a...

متن کامل

LGeRM: lemmatization of Middle French words

Unlike most modern languages, Middle French is a language whose spelling is not yet stabilized. There is a great deal of variation in the spelling of a word and accordingly the traditional methods for lemmatization cannot be used. LGeRM (lemmes, graphies et règles morphologiques) proposes a solution based on a databank containing known lemmatized spellings and a set of graphical and morphologic...

متن کامل

Mainstreaming August Strindberg with Text Normalization

This article explores the application of text normalization methods based on Levenshtein distance and Statistical Machine Translation to the literary genre, specifically on the collected works of August Strindberg. The goal is to normalize archaic spellings to modern day spelling. The study finds evidence of success in text normalization, and explores some problems and improvements to the proce...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JLCL

دوره 26  شماره 

صفحات  -

تاریخ انتشار 2011